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The classification of high-throughput sequencing data of protein-encoding genes is not as 
well established as for 16S rRNA. The objective of this work was to develop a simple and 
accurate method of classifying large datasets of pmoA sequences, a common marker 
for methanotrophic bacteria. A taxonomic system for pmoA was developed based on 
a phylogenetic analysis of available sequences. The taxonomy incorporates the known 
diversity of pmoA present in public databases, including both sequences from cultivated 
and uncultivated organisms. Representative sequences from closely related genes, such 
as those encoding the bacterial ammonia monooxygenase, were also included in the 
pmoA taxonomy. In total, 53 low-level taxa (genus-level) are included. Using previously 
published datasets of high-throughput pmoA amplicon sequence data, we tested two 
approaches for classifying pmoA: a naive Bayesian classifier and BLAST Classification of 
pmoA sequences based on BLAST analyses was performed using the lowest common 
ancestor (LCA) algorithm in MEGAN, a software program commonly used for the analysis 
of metagenomic data. Both the naive Bayesian and BLAST methods were able to 
classify pmoA sequences and provided similar classifications; however, the naive Bayesian 
classifier was prone to misclassifying contaminant sequences present in the datasets. 
Another advantage of the BLAST/LCA method was that it provided a user-interpretable 
output and enabled novelty detection at various levels, from highly divergent pmoA 
sequences to genus-level novelty. 
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1. INTRODUCTION 

High-throughput sequencing (HTS) technologies have aided our 
ability to investigate the diversity of microorganisms in envi- 
ronmental samples either by shotgun metagenomic or amplicon 
sequencing approaches. Many bioinformatic tools necessary to 
process and interpret the large volume of data obtained by HTS 
have been developed. For example, there are several choices of 
pipelines available to analyze 16S rRNA amplicon sequencing data 
such as RDP (Cole et al, 2005), mothur (Schloss et al, 2009) and 
QIIME (Caporaso et al., 2010). Similar strategies targeting genes 
encoding enzymes responsible for important biogeochemical or 
bioremediation processes are becoming more common, but the 
methods for analyzing the data are not as well established as for 
16S rRNA. 

The analysis of HTS amplicon data can be performed 
using taxonomy-dependent or independent approaches. The 
taxonomy-independent approach includes methods to com- 
pare sequence alignments and analyze operational taxonomic 
units (OTUs) based on sequence dissimilarity (Schloss and 
Handelsman, 2004; Cai and Sun, 2011). This approach is valu- 
able for estimating ecological parameters, such as richness and 
diversity; however, the information on its own does not indicate 



how the sequences are related to those of cultivated organisms 
or those from other studies. Taxonomy-based methods classify 
sequences according to their relatedness to those of pure cultures 
and uncultivated organisms. This approach is necessary to incor- 
porate knowledge of the physiological characteristics of different 
taxa, to identify novel sequence types and to compare results 
between published studies. Common methods for classification 
include naive Bayesian classifiers (Wang et al., 2007), k-nearest 
neighbor (Cole et al., 2005) and BLAST (Altschul et al., 1990). 

In general, the analysis of OTUs and phylogenetic trees cal- 
culated from individual datasets of protein-encoding genes can 
be performed with the same tools designed for the analysis of 
16S rRNA sequences. In contrast, classifiers must be tailor made 
for each gene by establishing a taxonomy with representative 
sequences and choosing an appropriate classification algorithm. 
The objective of this study was to establish a robust and eas- 
ily applied approach to classifying HTS amplicon sequences of 
pmoA, a key gene of methane -oxidizing bacteria. The method 
should also allow for novelty detection and be easily performed by 
a microbial ecologist with only fundamental knowledge of bioin- 
formatics. We test a naive Bayesian classifier and BLAST com- 
bined with the lowest common ancestor approach of MEGAN 
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using previously published pmoA pyrosequencing data (Luke and 
Frenzel, 2011; Deng et al., 2013). Previous studies have also com- 
pared both approaches for the classification of SSU rRNA (Lanzen 
et al, 2012) and fungal LSU rRNA sequences (Porter and Golding, 
2012). 

2. pmoA TAXONOMY 

An accurate taxonomic system for the gene sequences is a nec- 
essary prerequisite for classification. Since the classification of 
unknown sequences obtained by HTS can only be as accurate 
as the taxonomy, the analysis of database sequences and assign- 
ment of taxa is the critical step in the development of a classifier. 
In general, pmoA has been shown to be a good phylogenetic 
marker for methanotrophs (Degelmann et al, 2010), with some 
exceptions of divergent additional copies of the gene in some 
organisms (Dunfield et al, 2002; Stoecker et al, 2006; Baani and 
Liesack, 2008). Here we describe the taxonomy of pmoA genes 
(Table 1); earlier versions were described previously (Luke and 
Frenzel, 2011; Deng et al, 2013). 

2.1. OVERALL TAXONOMIC SYSTEM 

The pmoA gene encodes the P-subunit of the particulate 
methane monooxygenase (pMMO), which belongs to the 
class of copper-containing membrane-bound monooxygenase 
(CuMMO) enzymes. In addition to the pMMO, this group 
includes the bacterial ammonia monooxygenase (Holmes et al., 
1995), the thaumarchaeal ammonia monooxygenase (Pester et al., 
2011), alkane monooxygenases and various uncharacterized 
enzymes encoded by genes detected in environmental surveys 
(Coleman et al., 2012). For our classifier we compiled a database 
of pmoA and related gene sequences obtained primarily from 
public databases. We focused on building a taxonomic structure 
for pmoA, but also included sequences of related genes that are 
often co-amplified with common pmoA primers, such as the bac- 
terial amoA. Related sequences that are not co-amplified, such as 
the thaumarchaeal amoA, were not included. 

Currently, our curated database includes 6628 reference 
sequences corresponding to 53 low-level taxa (Table 1). The 
assignment to taxa was determined by the phylogenetic analysis 
of the pmoA and related gene fragments using both the nucleotide 
and inferred protein sequences. Sequences were imported into 
ARB (Ludwig et al., 2004) and alignments of either 408 nucleotide 
or 136 amino acid residues were used to generate neighbor- 
joining (NI) and maximum-likelihood (ML) trees. For ML trees, 
sequences were exported and uploaded to the RAxML web-server 
(Stamatakis et al, 2005). Tree topologies were compared and 
taxa were assigned according to groups of sequences that con- 
sistently clustered together in the analyses (Luke and Frenzel, 
2011). At the highest level, the sequences were categorized as 
MOB_like or AOB_like, depending on apparent relatedness to 
sequences from methane-oxidizing and ammonia-oxidizing bac- 
teria respectively. The classifier currently contains 53 low-level 
taxa within the MOB_like group (Table 1). Taxa comprising culti- 
vated methanotrophs were referred to as the respective genera or 
species (e.g., Mbacter, for Methylobacter-like pmoA). Taxa lack- 
ing isolates were named according to representative clones or to 
the environment in which they were predominantly or initially 



found (e.g., Aquifer_cluster or upland soil cluster — USC) (Luke 
and Frenzel, 2011). 

2.2. TYPE I AND II pmoA SEQUENCES 

The MOB_like sequences were assigned to either Type I, Type II 
or pXMO_like. The Type I sequences were further divided into 
Type la, b, or c. Type la zxtpmoA sequences affiliated to the classic 
Type I methanotrophs (i.e., not Type X). Type lb (also referred to 
elsewhere as Type X) are those of Methylococcus and closely related 
genera. Type Ic are all other Type I-related sequences with a more 
ambiguous affiliation. Type II sequences were divided into Type 
Ha and b. Type Ila was used for the primary pmoA sequences of 
the Methylocystaceae. Type lib was used to group all other Type II- 
related (i.e., Alphaproteobacteria) sequences, including those from 
the Beijerinckiaceae (Theisen et al, 2005; Dunfield et al., 2010; 
Vorobev et al. , 20 1 1 ) and the alternate pMM02 identified in some 
Methylocystis species (Dunfield et al., 2002; Baani and Liesack, 

2008) . 

2.3. pXMO: DIVERGENT pmoA SEQUENCES 

We use pXMO as the third category of pmoA-related sequences. 
The original description of pXMO was for the unusual pMMO- 
like enzyme identified in some Type I methanotrophs (Tavormina 
et al., 2011). Here we use pXMO_like to encompass all the diver- 
gent sequence types for which the substrate or biological function 
has not been clearly identified by biochemical or genetic tests. 
For example, we include the three verrucomicrobial pmoA-like 
sequences in this category until it is determined which, if not 
all, catalyze the oxidation of methane. The original pxmA genes 
identified in Methylomonas spp. (Tavormina et al., 2011) are clas- 
sified in the M84_P105 low-level taxon. We have also included the 
pmoA sequences from the nitrite-dependent anaerobic methane 
oxidizers belonging to the NC10 phylum (Ettwig et al., 2009, 
2010) into the pXMO_like category; it should be noted that these 
NC10 pmoA sequences are typically retrieved only after using 
specific primers and a special PCR program designed for their 
amplification (Luesken et al., 2011) and therefore are unlikely to 
be obtained in HTS pmoA surveys using the traditional pmoA 
primer sets. 

2.4. BACTERIAL AMMONIA MONOOXYGENASE 

Bacterial ammonia monooxygenase (amoA) genes were included 
since they are commonly co-amplified with pmoA genes in envi- 
ronmental PCR surveys. The amoA sequences of betaproteobac- 
terial ammonia oxidizers were designated AOB_like, without 
making further lower-level distinctions. In contrast, the amoA 
sequences of Nitrosococcus were classified as "Ncoccus" within the 
MOB_like group since they are more closely related to pmoA than 
to other amoA genes. 

3. SOFTWARE AND ASSOCIATED FILES 

Mothur version 1.29.2 (Schloss et al, 2009) and MEGAN ver- 
sion 4.70.4 (Huson et al, 2011) were both downloaded from the 
author's webpages. Standalone BLAST 2.2.26+ (Camacho et al., 

2009) was obtained from NCBI. These software programs can be 
installed on various platforms, but all analyses in this study were 
performed on a quad-core Apple MacBook Pro with a 2.2 GHz 
Intel Core i7 processor and 16 Gb of memory. 
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Table 1 | Description of the pmoA database. 



Classification 


MMO group designation 


Cultivated representative 


Database size 


Sequence 


Typical habitats 


level 






(sequences) 


representative 


or origin 
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Cu-containing membrane 
Monooxygenase (CuMMO) 
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0.1.1 


Type! 
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U. I. I. I 


Type la 










0.1.1.1.1 
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Methylobacter tundripaludum SV96 


552 


A I A 1 A CCO 

AJ4 l4bbo 


Arctic wetland 


0.1.1.1.2 


M m i c ro b i u m_j a p 


Methylomicrobium japanense 


28 


ADO COOC1 

AbZbJJb / 


Marine mud 


0.1.1.1.3 


Mmicrobium_pel 


Methylomicrobium pelagicum IR1 


25 


UJIbbz 


Upland soils 


0.1.1.1.4 


Mmonas 


Methyiomonas methanica 


396 


UJIbbJ 


Lake sediments 


0.1.1.1.5 


Msarcina 


Methyiosarcina quisquilarum 


517 


AN //Jzb 


Landtill soil 


0.1.1.1.6 


Msoma 


Methylosoma difficile LLz 


4 


U(Jl1yU4/ 


Lake sediment 


0.1.1.1.7 


Mbacter_or_Mmonas 




2 


AYzJoU/o 


Movile cave 
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Deep_sea_1 




46 
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C IOCOO 1 c 
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La n df i I l_c I u ste r_2 




/I 
4 


CI I07£ 117 

tuz / b 1 1 / 


Landfill soil 


0.1.1.1.12 


Lake_cluster_1 




72 


EF623667 


Freshwater lakes 


0.1.1.1.13 


RPC_2 




148 


FN600101 


Rice field soil 


0.1.1.1.14 


PS_80 




8 


AF211872 


Marine 


0.1.1.1.15 


Aquifer_cluster 




53 


AM410175 


Aquifer 


0.1.1.2 


Type lb 










0.1.1.2.1 


Mcaldum 


Methyiocaldum tepidum 


98 


U89304 


Agricultural soil 


0.1.1.2.2 


Mcoccus 


Methyiococcus capsulatus Bath 


244 


L40804 


Aquatic 
environments 


0.1.1.2.3 


Mthermus 


Methyiothermus thermalis 


36 


A loinmn 

AJozyuiu 


Hot spring 


0.1.1.2.4 


JnL_4 


Methyiogaea oryzae 


Z / 


lu Jbyuuz 


Rice field soil 


0.1.1.2.5 


Deep_sea_4 




26 


c* I I co a o on 
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Marine deep sea 


0.1.1.2.6 


Deep_sea_5 
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n 1 1 9 o 
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A ROOOQQ 1 
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Rice field soil 


0.1.1.2.9 


Lake_cluster_2 
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Freshwater lakes 


n 1 1 o 1 r\ 
(J. I. l.Z. IU 
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83 


UUUb /(Jby 


Freshwater lakes 


n 1 1 o 11 
(J. I. l.Z. I I 


ncr 
UbL 




1 o 

I o 


A I01700Q 

aj j i /yzo 


Organic soil 


0 110 10 

U. I. I.z. IZ 


HrL_ I 
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Rice field soil 
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RPPc 
nrLS 




I DO 
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Rice field soil 


n 1 1 o 
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Type I c 
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QQ 
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Marine 


0.1.1.3.2 


UbLg 




185 


A IC7DCC1 
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Upland soils 


0.1.1.3.3 


I DO 
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68 
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Upland soils 


0.1.1.3.4 


I DO 

Jn J 




bb 


avcc/i 7no 
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Upland soils 


n 1 o 
U. l.Z 


Type 1 1 










0.1.2.1 


Typella 










n 1 o 1 1 
U. l.Z. I. I 


Msinus 


Methyiosmus tnchosporlum 33/1 


"7n 
/U 


Aj4oyuu / 


Various 


f\ 1 O 1 o 
U. l.Z. l.Z 


Mcystis 


Methyiocystis sp. strain SC2 


1085 


A I A O 1 OOC 

AJ4J I Job 


Various 


n 1 o 1 o 

U. l.Z. I.J 


Msinus_Mcystis 


Methyiosmus tnchosporlum str. KS21 


"7Q 

/y 


A M 0 1 OQQ 
AJ4J loot) 


Various 


0.1.2.2 


Typellb 










0.1.2.2.1 


Mcapsa 


Methylocapsa acidiphila B2 


27 


AJ278727 


Sphagnum bog 


0.1.2.2.2 


M03 




23 


AF283229 


Landfill soil 


0.1.2.2.3 


pmoA2 


Methyiocystis sp. strain SC2 


45 


AJ431387 


Various 


0.1.2.2.4 


USCa 
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AF148521 


Upland soils 


0.1.2 


pXMOJike 










0.1.2.1 
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(Continued) 
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Table 1 | Continued 



Classification 


MMO group designation 


Cultivated representative 


Database size 


Sequence 


Typical habitats 


level 






(sequences) 


representative 


or origin 


0.1.2.1.1 


Verr_1 


Methylacidiphilum infernorum 


3 


EU223859 


Geothermal soil 


0.1.2.1.2 


Verr_2 


Methylacidiphilum infernorum 


3 


EU223862 


Geothermal soil 


0.1.2.1.3 


Verr_3 


Methylacidiphilum infernorum 


3 


EU223855 


Geothermal soil 


0.1.2.1.4 


TUSC 




101 


AJ868282 


Various 


0.1.2.1.5 


NC10 


Cand. Methylomirabilis oxyfera 


33 


JX262154 


Freshwater 
sediment 


0.1.2.2 


nAz l_IIKe 










0.1. z.z.1 


RA21 




157 


Ar148bzz 


Rice field soil 


0. I.z.z.z 


t. A o /l DOT 

Mo4_rzz 




9 


A nnnnco 

AJzyyybJ 


Rice field soil 


n 1 o o o 
0. I.z.z.o 


gp23 




1 


Arzo4 13/ 


Upland soils 


0. I.Z.Z.4 


Alkane_1 


Methylococcaceae ET-SHO 


2 


Ab4o3yb I 


Marine 


0.1.2.2.5 


Alkane_z 


Methylococcaceae hl-rllnu 


2 


Ad45396z 


Marine 


0.1.2.2.6 


MR1 




7 


AF200729 


Upland soils 


0.1.2.3 


M84_P105_like 










0.1.2.3.1 


M84_P105 


Methylomonas methanica 


34 


EU722433 


Various 


0.1.2.4 


Crenothrixjike 










0.1.2.4.1 


Crenothrix 


Crenothrix polyspora (enrichment) 


69 


DQ295904 


Freshwater 


0.1.2.4.2 


Crenothrix_rel 




160 


AJ868245 


Various 


0.2 


AOBJike 


Nitrosospira multiformis 


206 


AF042171 


Various 



3.1. NAiVE BAYESIAN CLASSIFIER 

The training set for the naive Bayesian classifier consists of two 
files: the database sequences in "fasta" format, and a taxonomy 
file indicating the taxonomy of each sequence. The taxonomy 
file was formatted to be compatible with mothur; both files are 
available in the supplement. The training set files were gener- 
ated by exporting sequence information from ARB (Ludwig et al., 
2004) and formatting the entries using standard spreadsheet and 
text-editing programs. 

3.2. BLAST AND MEGAN 

The BLAST database was generated from the taxonomy using 
the makeblastdb program included with BLAST 2.2.26+ package. 
The input was a fasta file with the sequence name header includ- 
ing the sequence accession number and the taxon in square brack- 
ets; as for the naive Bayesian classifier, these files were made using 
common spreadsheet and text-editing software, makeblastdb out- 
puts three files (.nsq,.nin, and.nhr); all files are provided in the 
supplementary material. 

A Newick format tree corresponding to the pmoA taxonomy 
was written for MEGAN; the tree file (pmoa.megan.2013.tre) 
and a corresponding map file (pmoa.megan.2013.map) are pro- 
vided in the supplement. The pmoA taxonomy is loaded into 
MEGAN by the option "use alternative taxonomy" and selecting 
the pmoa.megan.2013.tre file. 

4. pmoA AMPLIFICATION AND SEQUENCING 

Two primer sets are typically used to amplify pmoA sequences 
from environmental samples (McDonald et al, 2008). The 
A189f/A682r primer pair offers broad specificity covering 
many CuMMO monooxygenases (Holmes et al, 1995). The 
A189f/mb661r combination was designed to be more specific 



for pmoA (Costello and Lidstrom, 1999) and does not gener- 
ally amplify amoA or pxmA-\\kt sequences. For NGS amplicon 
sequencing, adaptors and barcodes are incorporated into the 
primers, or ligated onto the PCR products, in a manner compat- 
ible with the sequencing platform (Binladen et al., 2007; Berry 
etal, 2011). 

Two previously described HTS pmoA amplicon datasets were 
used in this study to test the classification methods (Luke 
and Frenzel, 2011; Deng et al., 2013). Both studies used the 
A189f/A682r and A189f/mb661r primer sets. The Luke and 
Frenzel (2011) study focused on rice field soil samples from Italy 
and China and the sequences were analyzed in the original study 
by constructing phylogenetic trees and calculating OTUs from 
sequence alignments. The Deng et al. (2013) study focused on 
peatland samples from both submerged (hollow) and elevated 
(hummock) sites in the Qinghai-Tibetan plateau, and were ana- 
lyzed in the original study by sequence similarity and an earlier 
implementation of the naive Bayesian classifier. The basic steps 
for classifying pmoA sequence data are summarized in Figure 1 
and a detailed protocol is included in the supplement. 

5. RAW SEQUENCE PROCESSING 

Raw sequence data must first undergo basic processing. In this 
study we use mothur, but other software can be used for this pur- 
pose, such as QIIME (Caporaso et al, 2010), RDP (Cole et al, 
2009) and Funframe (Weisman et al., 2013); each of these have 
unique features that might be beneficial for a particular dataset or 
objective. The basic necessary steps are the sorting of sequences 
according to barcodes, trimming and quality filtering. For data 
analyzed here we use a minimum sequence length of 300 bp 
and remove sequences with ambiguities or stretches of more 
than eight homopolymers. Chimeric sequences were identified in 
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Preprocess sequence data 
(Section 5) 



-> Sort according to barcodes, reduce 
sequencing error, trim barcodes & primers, 
remove chimeras & short sequences 



Classify using the naive Bayesian 
classification method (Section 6.1) 



-> Performed with mothur and is a rapid 
alternative to generating BLAST alignments 



Classify using BLAST alignments 
(Section 6.2) 



-> Align sequences against a curated pmoA 
database and import into MEGAN to determine 
diversity within and between samples. Also: 



Identify highly divergent CuMMO 
sequences (Section 7.1.1) 

-> Identified as sequences that cannot be 
aligned to the database by MEGABLAST, 
but produce TBLASTX alignments with 
scores > 50 bits 



Identify moderately divergent 
CuMMO sequences (Section 7.1.2) 

-> Adjust the cutoff in MEGAN to identify 
sequences with MEGABLAST scores 
between 50 and 150 bits 



Identify new pmoA clades 
(Section 7.1.3) 

-> Identified as sequences assigned 
to intermediate nodes using an LCA 
cutoff of 5% in MEGAN 



Identify divergence within pmoA taxa 
(Section 7.1.4) 

-> Examine the alignments to identify 
sequences with conserved mismatches against 
the closest relatives in the database 



FIGURE 1 | Overview of the basic procedure for classifying pmoA 
sequences obtained by high-throughput sequencing. The manuscript 
section where a procedure is described is indicated and detailed 
instructions are available in the supplementary materials. 



mothur using the uchime method (Edgar et al, 2011). Chimeras 
were detected de novo by using the "self" option in mothur, 
meaning that the pmoA pyrosequencing dataset was used as a ref- 
erence. As an example, in the HYa-1 dataset (Deng et al., 2013), 
1058 chimeras were identified in 7658 sequences. These sequences 
were removed using the remove. seqs command in mothur (see 
supplementary methods). 

NGS sequence technologies have relatively large error rates 
compared with Sanger sequencing and there are various 
approaches available to denoise pyrosequencing data (Reeder 
and Knight, 2010; Quince et al, 2011; Rosen et al, 2012). If 
not removed, these errors generate false OTUs (Schloss et al., 
2011). Another problem is that they often cause frameshift 
errors in protein-coding genes, making it difficult to infer 
amino acid sequences. There are methods available to specif- 
ically correct frameshift errors in functional gene pyrose- 
quencing datasets, such as the FrameBot tool (http://fungene. 
cme.msu.edu/FunGenePipeline/) and HMM-FRAME (Weisman 
et al., 2013). Correcting the frameshift errors in pyrosequencing 
datasets is particularly important for calculating OTUs or phy- 
logenetic distances based on amino acid sequences. In general 
we did not find denoising necessary for classifying our pmoA 
sequences obtained by pyrosequencing and therefore do not dis- 
cuss the methods here; however, a study on fungal LSU rRNA 
amplicon sequencing reported that the BLAST/LCA method was 
less sensitive to sequence error than the naive Bayesian classifica- 
tion approach (Porter and Golding, 2012). 

6. CLASSIFICATION OF pmoA SEQUENCES 

6.1. NAiVE BAYESIAN CLASSIFIER 

A naive Bayesian classifier is the basis of the RDP classifier for 
16S rRNA sequence data (Wang et al., 2007). In this study we 
have adopted the 16S rRNA classifier implemented in mothur, but 
replaced the 16S rRNA taxonomy with that ofpmoA. The process 
in mothur is invoked with classify.seqs command with the option 
of the "wang" method. Other variables include the kmer (word) 
size and the cutoff value for bootstrap confidence estimates. In 
general we found that the default kmer size of 8 performed well 
and used a cutoff value of 80%. 

6.2. BLAST/LCA 

Different versions of BLAST were tested. Nucleotide BLAST 
types include MEGABLAST, BLASTN and discontiguous (DQ- 
MEGABLAST, in decreasing order of sensitivity for obtain- 
ing matches and alignments against distantly related sequences 
(McGinnis and Madden, 2004). In most cases the results of the 
different methods were nearly identical, except in some cases 
that DC-MEGABLAST would produce hits (alignments) with dis- 
tantly related novel pmoA sequences that were not aligned by the 
other two methods. We found that the additional computation 
time required for DC-MEGABLAST was not compensated by the 
added sensitivity since this could be easily recovered by a reanaly- 
sis of the sequences that did not produce hits with MEGABLAST, 
as described in section 7.1.1. Protein BLAST approaches were 
also tested and classification results were similar to nucleotide 
BLAST, but with greater sensitivity for matching distantly related 
novel sequence types to the database. The protein BLAST searches 
were more vulnerable to the effect of frameshift errors in the 
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FIGURE 2 | Comparison of classifications of paddy soil pmoA sequence 
data (Luke and Frenzel, 2011) using the naive Bayesian and 
BLAST/LCA methods. The datasets were generated from three different 
soils (young Chinese, old Chinese, and Italian) and with two different PCR 
primer combinations (A189f/A682r, A189f/mb661 r) as indicated. The 
number of sequences assigned to each taxon is plotted. Only pmoA taxa 
detected in at least one dataset are shown. 



query sequences because they cause breaks in the alignment that 
strongly affect the bit score of a match. In contrast, the insertion 
of gap during nucleotide BLAST adds a small penalty to the bit 
score and does not terminate the alignment. Therefore, in general 
we use MEGABLAST, which is the default algorithm for the blastn 
program. 

BLAST has been shown to be relatively poor at identifying the 
most similar sequence in a dataset (Koski and Golding, 2001; Cole 
et al., 2005). We did not observe, nor do we foresee, this to be a 
problem for the classification since it is only necessary to identify 
the most similar taxon, which are all highly distinct (>5% nucleic 
acid identity) from one another. 

6.2. 1. BLAST interpretation by lowest common ancestor in MEGAN 

MEGAN was developed for the classification of metagenomic 
sequence data by reading the output of BLAST queries (Huson 
et al., 2011; Mitra et al., 2010). It has been adapted for other pur- 
poses, for example it can read the log file of a sina alignment 
of 16S rRNA (Pruesse et al, 2012) and thereby used to analyze 
16S rRNA sequence data (Mitra et al, 2011). Here we show that 
it could be adapted for the analysis of the pmoA BLAST queries 
against our taxonomy database, as has been demonstrated previ- 
ously for SSU rRNA (Lanzen et al., 2012) and fungal LSU rRNA 
(Porter and Golding, 2012). A simple modification is possible in 
MEGAN to change the default NCBI taxonomy to a custom tax- 
onomy, in this case pmoA. MEGAN parses the BLAST output file 
and collects only the top hits from each taxon and the associated 
alignment. In addition to summarizing the results, this has the 
added benefit of reducing the file size compared with the original 
BLAST output. 

MEGAN uses a lowest common ancestor (LCA) algorithm 
(Huson et al, 2007) based on BLAST bit scores to classify the 
sequences. A sequence is classified at a particular level only when 
the bit score to the taxon is higher by a given margin than to those 
of any another taxon. The margin of difference can be adjusted in 
the LCA parameters of MEGAN with the option of "top percent." 
The greater the margin, the greater is the minimum distance 
between the assigned taxon and any other taxon. The BLAST/LCA 
method provides several valuable benefits for the classification in 
comparison to simply classifying based on top hit, for example in 
novelty detection as discussed below. 

6.3. COMPARISON OF THE NAiVE BAYESIAN AND BLAST/LCA 
CLASSIFICATIONS 

We found good agreement in the classification of the rice paddy 
soil datasets using the naive Bayesian and BLAST/LCA approaches 
(Figure 2). Subsamples from each classification were analyzed in 
ARB by NJ and in general confirmed the assignments (results 
not shown). Some minor differences between the methods were 
also observed. For example, seven sequences in the China (old) 
A189f-A682r dataset were classified as gp23 whereas these seven 
sequences did not produce significant hits using BLAST/LCA. A 
close inspection indicated that the seven sequences were either 
highly divergent pmoA or non-specific PCR products related to an 
alpha-glucan branching protein or a peptidase. An analysis of the 
Riganqiao samples also showed a tendency for the naive Bayesian 
classifier to assign contaminant sequences to gp23 (not shown). 



This is likely a result of gp23 only being represented by a sin- 
gle sequence in the database and being relatively divergent from 
other pmoA taxa (Luke and Frenzel, 201 1). This erroneous classi- 
fication of non-target sequences as gp23 using the naive Bayesian 
classifier could be reduced by increasing the kmer size to 10, 
but at a cost of decreased sensitivity in classifying bona fide 
pmoA sequences. In spite of this, both the naive Bayesian and 
BLAST/LCA methods identified genuine gp23 sequences present 
in the China (young) A189f-A682r dataset (Figure 2). Previous 
studies also reported higher accuracies of classifications obtained 
by BLAST/LCA for 16S rRNA sequences (Lanzen et al, 2012), 
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fungal LSU rRNA sequences (Porter and Golding, 2012) and 
rRNA internal transcribed spacer sequences (Porter and Golding, 
201 1). However, one clear advantage of the naive Bayesian classi- 
fier was speed; on our system it could classify thousands of pmoA 
sequences per second compared to approximately 200 sequences 
per min for the MEGABLAST query. 

7. NOVELTY DETECTION 

Novel sequences can be identified by the naive Bayesian classifier 
if they cannot be classified. For example, a novel taxon within the 
Type Ib's should be returned as "unclassified Type lb." It is possi- 
ble to adjust this further by specifying a percent cutoff value for 
a bootstrapped analysis. In general, we found this method to be 
unreliable as even contaminant sequences were classified as gp23 
with >80% bootstrap values, as discussed above. 

7.1. NOVELTY DETECTION USING BLAST/LCA IN MEGAN 

The BLAST/LCA procedure offers approaches for novelty detec- 
tion at various levels, which are described individually in the 
following sections. 

7. 1. 1. Identification of highly divergent CuMMO sequences 

Deeply-branching novel sequences did not produce hits to the 
database, which is the first indication that it is potentially a novel 
sequence clade. This occurs in BLAST queries when the sequence 
does not contain a match of the minimum word size, which is 
28 nucleotides in MEGABLAST. For example, this was the case 
in the Riganqiao dataset with the novel clade that was termed 
HY-3 (Deng et al., 2013). In contrast to MEGABLAST, these 
sequences could be identified as pmoA by a translated BLAST 
query (TBLASTX). The difference is that protein sequences are 
more conserved than nucleotide sequences and BLASTX uses a 
word size of 3 amino acids compared with a minimum word 
size of 7 nucleotides for BLASTN. Therefore, the first step in 
novelty detection is to select the sequences that do not produce 
MEGABLAST hits and to query them against the database using 
TBLASTX; in our experience, highly divergent novel clusters 
will produce significant hits (>50 bits) with TBLASTX, whereas 
unrelated contaminant sequences will not. 

7. 1.2. Identification of moderately divergent CuMMO sequences 

Moderately divergent novel sequence clades can be identified by 
a relatively low MEGABLAST bit score. The bit score cutoffs can 
be adjusted in MEGAN and here we used a threshold of 150 bits 
to identify new clades. An example of a novel sequence that could 
be identified in this manner was the I141NRXW sequence from 
the paddy soil (Luke and Frenzel, 2011), which had a top score 
of 89.8 bits. In comparison, the sequence with the next lowest bit 
score in that dataset had a value of 374, and maximum bit scores 
approached 900. 

7. 1.3. Lowest common ancestor classification of sequences 

The next level of novelty can be identified using the lowest 
common ancestor (LCA) algorithm in MEGAN. For the HYa- 
1 dataset, assignments to higher nodes could be seen at the 
Type II (15 sequences), Type lb (9 sequences) and Type Ha (151 
sequences) taxonomic levels using a margin of 5% for the LCA 
calculation (Figure 3A). In contrast, classifications to the other 



lowest-level taxa were stable even using an LCA margin greater 
than 25%. The inability to classify sequences to the lowest levels 
indicates that the sequence may represent a new taxon branch- 
ing from the LCA node. For example, the sequences assigned 
to the Type II node had similar bit scores to both Methylocystis 
and pmoA2 clades and a NJ analysis of the sequences also sug- 
gested that it was new lineage (termed HY-4) at the root of the 
Type IPs (Figure 4). The assignments at the node of the Type Ila 
branch were the result of an inability to distinguish between the 
Methylocystis and Methylosinus-Methylocystis taxa, which might 
indicate intermediate sequence types or the existence of a bush- 
like continuum in this region of the tree (Luke and Frenzel, 
2011). 

In addition to biological diversity, there could also be techni- 
cal errors that impede classification, such as sequencing noise or 
chimerism. Furthermore, falsely assigned sequences in the taxon- 
omy file would also result in sequences failing to be classified at 
the lowest level. Both of these situations can generally be detected 
by visually analyzing the BLAST alignments, as described in the 
following section. 

7. 1.4. Lowest-level diversity: examination of hits and alignments 

The final level of novelty can be detected by examining the hits 
and alignments against the database. BLAST alignments of indi- 
vidual reads (Figure 3B) or summary alignments for groups of 
sequences (Figure 3C) can be examined in MEGAN. Examining 
individual reads gives an impression of how closely related a 
sequence is to members of the assigned taxon compared to the 
next-nearest taxon. In the example shown (Figure 3B), the top 
hit had 97% identity to a JRC_3 (702 bits) and the next best hit of 
only 89% identity (272 bits) to RPC_1. In this example, it is evi- 
dent that the sequence is genuinely JRC_3. In contrast, the hits to 
Methylocapsa had maximum identities of only 95% (not shown), 
suggesting that these might represent a closely related novel clus- 
ter. An analysis by NJ of these sequences indicated that indeed 
they formed a new branch close to Methylocapsa (termed HY-2) 
(Figure 4). 

The second option is to invoke the alignment of the hits to a 
taxon. Here it is possible to see evidence of novelty within a group 
of sequences classified to a particular taxon. For example, in some 
cases numerous conserved mismatches against the top database 
reference can suggest that the sequences belong to a divergent 
clade mostly related to the assigned taxon. For example, about 
40 conserved mismatches were present within the assignment to 
USCct in the HYa-1 dataset (Figure 3C). A closer analysis by NJ of 
these sequences indicated they were a novel lineage most closely 
related to USCct (termed HY-1) (Figure 4). 

8. DATA COMPARISONS AND DOWNSTREAM ANALYSIS 

MEGAN has several built-in functions that offer possibilities to 
visualize and analyze the results. Comparisons between sam- 
ples can be easily made using various visualization options. 
Trends in the data can be observed and demonstrated, such 
as the relative coverage of the two primer sets and the ele- 
vated abundance of RPCs in hollow soils (Figure 5). MEGAN 
can also calculate matrices of pairwise distances using six eco- 
logical measures: Euclidian, Goodall, Chi-Square, Kulczynski, 



www.frontiersin.org 



February 2014 | Volume 5 | Article 34 | 7 



Dumont et al. 



BLAST/LCA pmoA classification 



s=3 s=s s-B g=S $=* (R 
I 



. 3=3 

1 I! 



£ = .! 



=3 t=S 



1=1 g-l 1=1 1=1 1=1 1=1 1=1 



1=1 1=1 



3 pi 

1 11 



=1 5=3 



S=5 feg K E | |E| 



3 s=s a=s s=s i-§ 



t j t s t s t : s : t 
I : 5 : 5 : i : 8 S i 



ill! 



in ijuy 



i 

» 



4 - - 

JUUUUUUU 
L L , O O !^ M- !J r^; r_j 'Ji 'J 

C«< <rt -1: -i: 

SSSPSSSSBE 

juuuuuuuyu 

U U :j '_; :_! !j 

uuuuuuuuuu 

=" b- Er" H (=■ !=■ H It- H I- 



L -hhhhhhhhF' 

« -t. < < -i ■ ■ 

HHE-HHHHffH 



CD CD 

2 E 

*^ C 

© CD 

fE 

. "O *-> 
"(DM 

n +f o) 

T- U _Q 

O CD 

(M Q. © 

.. co JZ 

— E 

■ © ? 



> © >. 



CD CO 
ZS CD 

-o " 



c 

© 

o — 



; o 



._ ^ 

CD C 

C 5 

ft .2> 6 
,sr tn jz 



"O 

CD 
03 



al 5 

O o 



o 6 



oQ ? 9 



O 1 * 



-6 



o 



0) to O 



2 c 

o 
a 



E 0 

Z5 .£3 



~ n £ © 

u ! - 
to 



1 — ' 1 — O 

> a c v 

I O CT to 

a) cl To 2 

^ CD >~ C 

m— nj 

o o) E ' ' 



o 



_ E 2. 

5 => 0 

c C/5 ^ 

0 _ ° 

-S m £ 



^ CO 

0 6 

0 a: 



= 7. 0 
•<£ 

C w ° 
H) 0 

« £Z g 



03 



O — ■ 
g CO ro 

o 



6 « 
"5 S 



N 



0 E 

0 to 



— "O 

CO "~ 
3 - 

.£2 "o > 

> © CT C 

i.— tr Hi 



.£ -a 



_ = .£ 0 

Z p to 

rf c 0 cz 

S 0 0 o 

*■ 2 £ >■ 

o ^ 10 0 

t != g g 

e S x 

C C£ CD o 

"ceo. 

C O Q. 

0 .£ 0 

UJ t o ^ 

LL CO 1= 3 



X 
LU 



CO 



Frontiers in Microbiology | Terrestrial Microbiology 



February 2014 | Volume 5 | Article 34 | 8 



Dumont et al. 



BLAST/LCA pmoA classification 




FIGURE 4 1 Analysis of selected Type II pmoA sequences from the 
HYa-1 sample (Deng et al., 2013). The sequences chosen for 
analysis are color-coded in the partial MEGAN tree (A). The USCo and 
Mef/iy/ocapsa-assigned sequences were selected since the alignments 
showed conserved mismatches to the reference database (as shown 
for USCct sequences in Figure 3C). The sequences were imported 



into an ARB pmoA database, quality filtered by removing sequences 
with frameshifts, translated to amino acid sequences and added to 
the PmoA tree by parsimony and then reanalyzed by neighbor-joining. 
The positions of the sequences analyzed in ARB are shown (B). The 
new clades were named HY-1, HY-2 (Deng et al., 2013) and HY-4 (this 
study). 
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FIGURE 5 | MEGAN comparison view of pmoA classifications from the 
Riganqiao pmoA pyrosequencing datasets (Deng et al., 2013). The pmoA 
datasets were obtained from triplicate samples from hummock (HYa) and 
hollow (HYb) sites. PCRs were performed with two primer combinations 
(A189f/A682r or A189f/mb661 r), as indicated. The option to subsample 



datasets (3309 sequences) was chosen for the comparison. Assignments to 
internal nodes are not shown. MEGAN only shows taxa detected in at least 
one sample. The height of the bars was scaled to the number of reads 
assigned in each dataset and color-coded as indicated in the legend. The 
labeling at the top of the columns was added. 
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Bray-Curtis, and Hellinger (Mitra et al., 2010). Data can also be 
easily exported into statistical software programs. Of course, the 
classification of sequence data to taxa is only one step in the anal- 
ysis of HTS amplicon data of protein-coding genes and should be 
complemented by classification-independent analyses. 

9. CONCLUSIONS 

Although the naive Bayesian and BLAST/LCA methods provided 
similar classifications of the high-throughput pmoA sequence 
data examined in this study, the BLAST/LCA approach had 
several advantages, such as being less sensitive to false classifica- 
tion of contaminant sequences and offering several options for 
novelty detection at various levels of sequence divergence. The 
BLAST/LCA method has another advantage that a researcher can 
visually interpret the calculations, in the form of alignments, 
therefore enabling the results to be verified and judged. 

SUPPLEMENTARY MATERIAL 

The Supplementary Material for this article can be found 
online at: http://www.frontiersin.org/journal/10.3389/fmicb. 
2014.00034/abstract 
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